Statistical Language Modelling

نویسندگان

  • Yoshihiko Gotoh
  • Steve Renals
چکیده

Grammar-based natural language processing has reached a level where it can ‘understand’ language to a limited degree in restricted domains. For example, it is possible to parse textual material very accurately and assign semantic relations to parts of sentences. An alternative approach originates from the work of Shannon over half a century ago [41], [42]. This approach assigns probabilities to linguistic events, where mathematical models are used to represent statistical knowledge. Once models are built, we decide which event is more likely than the others according to their probabilities. Although statistical methods currently use a very impoverished representation of speech and language (typically finite state), it is possible to train the underlying models from large amounts of data. Importantly, such statistical approaches often produce useful results. Statistical approaches seem especially well-suited to spoken language which is often spontaneous or conversational and not readily amenable to standard grammarbased approaches. This chapter concerns statistical language modelling. In a speech recognition system the role of the language model is to assign probabilities to word sequences. Recently, similar models to speech recognition language models have been employed to perform higher level tasks, such as structuring and extracting information from spoken language. In this chapter, we first outline the basic framework of n-gram language models (section 2), which form the core of current statistical approaches. A crucial technical consideration here is how to estimate n-gram statistics from sparse training data. We go on to describe two approaches—based on n-gram models—to encapsulate varying contents and styles: section 3 is concerned with mixture language models and section 4 builds on the observation that the occurrence rate of a word is not uniform, but varies between documents. Finally we describe a statistical finite state model for the extraction of information, such as proper names and dates from spoken language.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Modelling of Highly Inflective Languages

A language model is a description of language. Although grammar has been the prevalent tool in modelling language for a long time, interest has recently shifted towards statistical modelling. This chapter refers to speech recognition experiments, although statistical language models are applicable over a wide-range of applications: machine translation, information retrieval, etc. Statistical mo...

متن کامل

Progresses in continuous speech recognition based on statistical modelling for romanian language

In this paper we will present progresses made in Automatic Speech Recognition (ASR) for Romanian language based on statistical modelling with hidden Markov models (HMMs). The progresses concern enhancement of modelling by taking into account the context in form of triphones, improvement of speaker independence by applying a gender specific training and enlargement of the feature categories used...

متن کامل

Bimodal Modelling of Source Code and Natural Language

We consider the problem of building probabilistic models that jointly model short natural language utterances and source code snippets. The aim is to bring together recent work on statistical modelling of source code and work on bimodal models of images and natural language. The resulting models are useful for a variety of tasks that involve natural language and source code. We demonstrate thei...

متن کامل

Trameur: A Framework for Annotated Text Corpora Exploration

Corpus resources with complex linguistic annotations are becoming increasingly important in the work of language specialists. They often need to perform extensive corpus research, including Natural Language Processing (NLP), statistical modelling and data visualisation. Our software system, called Trameur, aims at making these analyses possible within a single graphical user interface. It relie...

متن کامل

Concept Description Language for Statistical Data Modelling

In this paper we describe a new language for statistical data modelling, which offers a general framework for the representation of elementary and summary data. Three are the main characteristics of the language: 1) the types of modeling primitives of the language are particularly suited for representing objects from a statistical point of view; 2) the language includes a rich set of structurin...

متن کامل

Adding intelligent help to mixed-initiative spoken dialogue systems

The rapidly expanding voice recognition industry has so far shown a preference for grammar-based language modelling, despite the better overall performance of statistical language modelling. Given that the advantages of the grammar-based approach make it unlikely to be replaced as the primary solution in the near future, it is natural to wonder whether some combination of the two approaches may...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000